Skip to content

fix: strip CWD .env leak, enable platform adapters in serve, add first-event timeout#1092

Merged
Wirasm merged 8 commits intodevfrom
archon/task-fix-issue-1067
Apr 12, 2026
Merged

fix: strip CWD .env leak, enable platform adapters in serve, add first-event timeout#1092
Wirasm merged 8 commits intodevfrom
archon/task-fix-issue-1067

Conversation

@coleam00
Copy link
Copy Markdown
Owner

@coleam00 coleam00 commented Apr 11, 2026

Summary

  • Problem: Three bugs in v0.3.5: (1) Bun auto-loads CWD .env before user code — non-overlapping keys from target repo leak into Archon process even after the override: true partial fix; (2) archon serve hardcodes skipPlatformAdapters: true, silently preventing all platform adapters (Telegram, Discord, Slack, GitHub) from ever starting; (3) No first-event timeout on Claude SDK query — subprocess wedge causes silent 30-min hang at dag_node_started.
  • Why it matters: Bug 1 causes misconfigured processes (e.g. wrong LOG_LEVEL, leaked tokens) when running Archon from within a target repo. Bug 2 makes archon serve permanently Web-only regardless of token configuration. Bug 3 gives no actionable error when the Claude subprocess fails to start.
  • What changed: Added stripCwdEnv() boot utility in @archon/paths that strips Bun-auto-loaded CWD env keys before any module reads process.env; removed the one-line skipPlatformAdapters: true hardcode from serve.ts; added withFirstMessageTimeout wrapper around the query() call in ClaudeClient with configurable timeout and structured diagnostics; added CLAUDECODE=1 nested-session warning to CLI.
  • What did not change: Platform adapter behavior when tokens are absent (unchanged — adapters self-gate on their own env vars); ~/.archon/.env loading logic; all existing env-leak-gate / SUBPROCESS_ENV_ALLOWLIST mechanisms.

UX Journey

Before

User (in /some/repo with .env containing LOG_LEVEL=debug)

$ archon workflow run assist "hello"
  Bun auto-loads /some/repo/.env -> LOG_LEVEL=debug leaks into process
  ~/.archon/.env loads with override:true (only fixes overlapping keys)
  LOG_LEVEL=debug survives -> noisy debug logs throughout
  If workflow hangs -> silent wait for 30 minutes

$ archon serve  (with TELEGRAM_BOT_TOKEN in ~/.archon/.env)
  startServer({ skipPlatformAdapters: true }) -> Telegram never starts
  Log: "no_platform_adapters_configured" despite token present

After

User (in /some/repo with .env containing LOG_LEVEL=debug)

$ archon workflow run assist "hello"
  stripCwdEnv() strips /some/repo/.env keys BEFORE any module init
  ~/.archon/.env loads cleanly
  LOG_LEVEL from ~/.archon/.env wins -> correct log level
  If workflow hangs -> after 60s: clear error + claude.first_event_timeout log

$ archon serve  (with TELEGRAM_BOT_TOKEN in ~/.archon/.env)
  startServer({ webDistPath, port }) -> Telegram adapter starts normally
  Log: "telegram.adapter_initialized"

$ CLAUDECODE=1 archon workflow run ...
  stderr warning: nested session detected, workaround instructions printed

Architecture Diagram

Before

cli.ts
  dotenv (override:true) <- only fixes overlapping keys
  [CWD .env keys survive in process.env]

serve.ts
  startServer({ skipPlatformAdapters: true })
    Telegram/Discord/Slack/GitHub: NEVER STARTS

claude.ts
  query(prompt, options)
    for await (msg of events)  <- no first-event deadline

After

cli.ts
  import '@archon/paths/strip-cwd-env-boot'  [+]  <- strips ALL CWD .env keys
  dotenv (no override needed)

paths/strip-cwd-env.ts  [+]
  stripCwdEnv(): parses CWD .env files, deletes matching keys from process.env

serve.ts
  startServer({ webDistPath, port })  [~] skipPlatformAdapters removed
    Telegram: starts when TELEGRAM_BOT_TOKEN set
    Discord: starts when DISCORD_BOT_TOKEN set
    Slack: starts when SLACK_BOT_TOKEN set

claude.ts
  withFirstMessageTimeout(gen, controller, timeoutMs, diagnostics)  [+]
    Promise.race(gen.next(), setTimeout(timeoutMs))
  buildFirstEventHangDiagnostics()  [+]  <- structured log for debugging

Connection inventory:

From To Status Notes
cli.ts @archon/paths/strip-cwd-env-boot new First import, side-effect only
strip-cwd-env-boot.ts strip-cwd-env.ts new Boot wrapper -> pure fn
serve.ts startServer() modified Removed skipPlatformAdapters: true
claude.ts withFirstMessageTimeout new Wraps query() generator
claude.ts buildFirstEventHangDiagnostics new Structured diagnostics

Label Snapshot

  • Risk: risk: low
  • Size: size: S
  • Scope: cli, paths, core
  • Module: cli:boot, paths:env, core:claude-client

Change Metadata

  • Change type: bug
  • Primary scope: multi

Linked Issue

Validation Evidence (required)

bun run validate
Type check  - 0 errors (all 9 packages)
Lint        - 0 errors, 0 warnings
Format      - all files clean
Tests       - all passed (paths: 119, core: 800+, cli: 130, all others pass)
  • Evidence provided: full bun run validate run — type-check, lint, format, tests all pass
  • No commands skipped

Security Impact (required)

  • New permissions/capabilities? No
  • New external network calls? No
  • Secrets/tokens handling changed? Yes — stripCwdEnv() removes CWD .env keys from process.env (security improvement: prevents target-repo tokens from leaking into Archon process and subprocesses)
  • File system access scope changed? No — stripCwdEnv() only reads (parses without writing) CWD .env files
  • If yes, describe risk and mitigation: Uses dotenv.config({ processEnv: {} }) to parse without re-contaminating; only removes keys appearing in CWD .env files, never touches ~/.archon/.env keys

Compatibility / Migration

  • Backward compatible? Yes — adapters self-gate on their own env vars (no token = no adapter start, same as before)
  • Config/env changes? Yes — new optional env vars: ARCHON_CLAUDE_FIRST_EVENT_TIMEOUT_MS (default: 60000ms) and ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING
  • Database migration needed? No

Human Verification (required)

  • Verified scenarios: Full bun run validate suite (type-check, lint, format, tests) passed with 0 failures
  • Edge cases checked: stripCwdEnv() handles missing files, malformed lines, keys absent from process.env; withFirstMessageTimeout tested for normal completion, stuck generator timeout, and error message content
  • What was not verified: Live archon serve with actual Telegram/Slack tokens; actual subprocess hang scenario (environmental — cannot reproduce deterministically in CI)

Side Effects / Blast Radius (required)

  • Affected subsystems: CLI boot sequence, archon serve platform adapter startup, ClaudeClient query loop
  • Potential unintended effects: stripCwdEnv() runs before any module init — if a user intentionally relied on CWD .env keys in the Archon process, they would need to move those keys to ~/.archon/.env (documented intended behavior)
  • Guardrails/monitoring: ARCHON_CLAUDE_FIRST_EVENT_TIMEOUT_MS allows tuning or effectively disabling the timeout; claude.first_event_timeout log event provides structured diagnostics for hang diagnosis

Rollback Plan (required)

  • Fast rollback command/path: git revert dcd392f3
  • Feature flags or config toggles: ARCHON_CLAUDE_FIRST_EVENT_TIMEOUT_MS (set very high to effectively disable timeout); ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING (suppress nested session warning)
  • Observable failure symptoms: If stripCwdEnv() regresses, users see ~/.archon/.env keys missing (auth failures); if platform adapter regression, no_platform_adapters_configured in logs despite tokens being set

Risks and Mitigations

  • Risk: stripCwdEnv() might remove keys user intended Archon process to see
    • Mitigation: Only removes keys parsed from CWD .env files Bun auto-loads. ~/.archon/.env loads after, so Archon config keys are always present. Documented in code comments.
  • Risk: 60s first-event timeout too short for slow networks or cold Claude SDK starts
    • Mitigation: ARCHON_CLAUDE_FIRST_EVENT_TIMEOUT_MS env var allows increasing the timeout. Default 60s is well above the normal 2-3s startup time observed in practice.


Issues resolved

PRs superseded (close on merge)

Related (not fixed here)

Credits

Summary by CodeRabbit

  • New Features

    • Added first-event timeout detection for Claude subprocess hangs (configurable via ARCHON_CLAUDE_FIRST_EVENT_TIMEOUT_MS).
    • Improved environment variable isolation on startup.
    • Added ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING to suppress warnings in specific environments.
  • Bug Fixes

    • Enhanced subprocess environment sanitization.
  • Documentation

    • Updated CLI and security documentation to reflect new startup behavior.
    • Added troubleshooting guide for nested session scenarios.

…t-event timeout (#1067)

Three bugs fixed: (1) Bun auto-loads CWD .env files before user code, leaking
non-overlapping keys into the Archon process — new stripCwdEnv() boot import
removes them before any module reads env. (2) archon serve hardcoded
skipPlatformAdapters:true, preventing Slack/Telegram/Discord from starting.
(3) Claude SDK query had no first-event timeout, causing silent 30-min hangs
when the subprocess wedges — new withFirstMessageTimeout wrapper races the
first event against a configurable deadline (default 60s).

Changes:
- Add @archon/paths/strip-cwd-env and strip-cwd-env-boot modules
- Import boot module as first import in CLI entry point
- Remove skipPlatformAdapters: true from serve.ts
- Add withFirstMessageTimeout + diagnostics to ClaudeClient
- Add CLAUDECODE=1 nested-session warning to CLI
- Add 9 unit tests (6 strip-cwd-env + 3 timeout)

Fixes #1067

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 11, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6a7b89d9-c54b-4d5b-b2f2-ed9fa30acf55

📥 Commits

Reviewing files that changed from the base of the PR and between 536584d and 76b49b5.

⛔ Files ignored due to path filters (1)
  • bun.lock is excluded by !**/*.lock
📒 Files selected for processing (19)
  • .claude/rules/cli.md
  • CLAUDE.md
  • packages/cli/src/cli.ts
  • packages/cli/src/commands/serve.ts
  • packages/core/package.json
  • packages/core/src/clients/claude.test.ts
  • packages/core/src/clients/claude.ts
  • packages/core/src/utils/env-allowlist.test.ts
  • packages/core/src/utils/env-allowlist.ts
  • packages/docs-web/src/content/docs/reference/cli.md
  • packages/docs-web/src/content/docs/reference/configuration.md
  • packages/docs-web/src/content/docs/reference/security.md
  • packages/docs-web/src/content/docs/reference/troubleshooting.md
  • packages/paths/package.json
  • packages/paths/src/env-integration.test.ts
  • packages/paths/src/strip-cwd-env-boot.ts
  • packages/paths/src/strip-cwd-env.test.ts
  • packages/paths/src/strip-cwd-env.ts
  • packages/server/src/index.ts

📝 Walkthrough

Walkthrough

This PR replaces the subprocess environment allowlist security model with CWD environment variable stripping at CLI boot time. It removes SUBPROCESS_ENV_ALLOWLIST and buildCleanSubprocessEnv(), adds a new stripCwdEnv() function to remove Bun auto-loaded .env keys and nested Claude Code markers before module initialization, removes platform adapter skipping from archon serve, and adds first-event timeout detection to prevent silent subprocess hangs.

Changes

Cohort / File(s) Summary
CWD Environment Stripping Module
packages/paths/src/strip-cwd-env.ts, packages/paths/src/strip-cwd-env-boot.ts
New modules implementing stripCwdEnv(cwd) to remove Bun auto-loaded .env keys and nested Claude Code session markers from process.env at boot time, plus a side-effect-only boot entry point for CLI/server startup.
CWD Environment Stripping Tests
packages/paths/src/strip-cwd-env.test.ts, packages/paths/src/env-integration.test.ts
New comprehensive test suites validating .env key stripping, nested Claude Code marker removal, and full env isolation flow including ~/.archon/.env loading and subprocess env passthrough.
Paths Package Exports
packages/paths/package.json
Updated package exports to include ./strip-cwd-env and ./strip-cwd-env-boot; added dotenv runtime dependency.
Removed Environment Allowlist
packages/core/src/utils/env-allowlist.ts, packages/core/src/utils/env-allowlist.test.ts
Deleted SUBPROCESS_ENV_ALLOWLIST constant and buildCleanSubprocessEnv() function; removed corresponding test suite validating allowlist filtering behavior.
Claude Client Subprocess Handling
packages/core/src/clients/claude.ts, packages/core/src/clients/claude.test.ts
Removed env allowlist filtering from buildSubprocessEnv() to pass through all process.env keys directly; added withFirstMessageTimeout() async generator wrapper for first-event timeout detection (default 60s, configurable via ARCHON_CLAUDE_FIRST_EVENT_TIMEOUT_MS); updated tests to verify env passthrough and timeout behavior including diagnostics logging.
CLI and Server Startup
packages/cli/src/cli.ts, packages/server/src/index.ts
Added top-level boot imports (@archon/paths/strip-cwd-env-boot) to strip CWD env before module initialization; updated startup documentation to clarify CWD stripping occurs before ~/.archon/.env loading.
Serve Command Platform Adapters
packages/cli/src/commands/serve.ts
Removed hardcoded skipPlatformAdapters: true option from startServer({...}) call, enabling platform adapters (Telegram/Discord/Slack) to initialize when configured.
Core Package Test Configuration
packages/core/package.json
Removed explicit bun test src/db/workflows.test.ts entry from test script chain.
Documentation Updates
packages/docs-web/src/content/docs/reference/cli.md, .claude/rules/cli.md, CLAUDE.md
Updated startup sequence descriptions to document CWD env stripping step before ~/.archon/.env loading.
Configuration Documentation
packages/docs-web/src/content/docs/reference/configuration.md
Documented new environment variables ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING and ARCHON_CLAUDE_FIRST_EVENT_TIMEOUT_MS; updated .env loading behavior description to reflect CWD stripping + archon env loading model.
Security Documentation
packages/docs-web/src/content/docs/reference/security.md
Replaced subprocess env isolation description from allowlist-based filtering to CWD stripping + trusted archon env source model; updated env-leak gate explanation.
Troubleshooting Documentation
packages/docs-web/src/content/docs/reference/troubleshooting.md
Added new section documenting nested Claude Code session hangs when Archon is launched from within a Claude Code session, with recommended operational guidance and environment variable controls.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI Startup
    participant Boot as `@archon/paths/strip-cwd-env-boot`
    participant StripCwd as stripCwdEnv()
    participant ProcessEnv as process.env
    participant Dotenv as dotenv
    participant App as Application Code

    CLI->>Boot: Execute boot import (first import)
    Boot->>StripCwd: Call stripCwdEnv()
    StripCwd->>ProcessEnv: Remove CWD .env keys<br/>(from Bun auto-load)
    StripCwd->>ProcessEnv: Remove CLAUDECODE marker<br/>& nested Claude Code vars
    StripCwd->>ProcessEnv: Delete NODE_OPTIONS,<br/>VSCODE_INSPECTOR_OPTIONS
    StripCwd-->>Boot: Return (side-effect complete)
    Boot-->>CLI: Boot module loaded
    CLI->>Dotenv: Load ~/.archon/.env<br/>with override: true
    Dotenv->>ProcessEnv: Merge archon config<br/>(wins over inherited vars)
    Dotenv-->>CLI: Env loaded
    CLI->>App: Import remaining modules<br/>(read sanitized process.env)
    App-->>CLI: Ready
Loading
sequenceDiagram
    participant Client as Claude Client
    participant Query as query() SDK
    participant Timeout as withFirstMessageTimeout()
    participant Gen as AsyncGenerator
    participant Timer as setTimeout()
    participant Subprocess as Claude Subprocess

    Client->>Timeout: Call with timeout config<br/>(60s default)
    Timeout->>Timer: Start race timer
    Timeout->>Gen: Call gen.next()
    Gen->>Subprocess: Spawn subprocess
    Subprocess-->>Gen: First event arrives
    Gen-->>Timeout: Yield event
    Timeout->>Timer: Clear timeout (first event received)
    Timeout-->>Client: Return event
    
    Note over Timeout: If timeout fires first:
    Timeout->>Subprocess: Abort controller
    Subprocess-->>Subprocess: Cleanup
    Timeout-->>Client: Throw FirstEventTimeoutError<br/>+ log diagnostics
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰 A rabbit hops with joy profound,
At stripping CWD env around,
Boot-time magic clears the path,
Before processes read their wrath,
And timeouts catch the hanging thread,
No more silent futures dread!

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch archon/task-fix-issue-1067

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coleam00
Copy link
Copy Markdown
Owner Author

🔍 Comprehensive PR Review

PR: #1092 — fix: strip CWD .env leak, enable platform adapters in serve, add first-event timeout
Reviewed by: 5 specialized agents (code-review, error-handling, test-coverage, comment-quality, docs-impact)
Date: 2026-04-11


Summary

This PR correctly fixes three real bugs from issue #1067. The implementation is focused and well-reasoned — the stripCwdEnv boot pattern is clean, the withFirstMessageTimeout timeout path has exemplary error handling, and the serve.ts one-liner fix is exactly right. Two issues need to be addressed before merge.

Verdict: REQUEST_CHANGES

Severity Count
🔴 CRITICAL 0
🟠 HIGH 2
🟡 MEDIUM 4
🟢 LOW 10

🟠 High Issues (Should Fix Before Merge)

HIGH-1: Dangling setTimeout timer leaks on every successful query

📍 packages/core/src/clients/claude.ts:279-286

withFirstMessageTimeout races gen.next() against a setTimeout, but never cancels the timer when the generator wins (the normal path). The timer remains live for the full timeoutMs (default 60 seconds) after every successful workflow query. In server mode this accumulates one dangling timer per concurrent workflow execution, preventing clean event loop drain on shutdown. In the existing happy-path test, a 5-second timer persists after the test which can interfere with Bun's test runner exit heuristics.

Fix: Store the timer handle and call clearTimeout(timerId) in a finally block after the Promise.race try/catch:

let timerId: ReturnType<typeof setTimeout> | undefined;
let firstValue: IteratorResult<T>;
try {
  firstValue = await Promise.race([
    gen.next(),
    new Promise<never>((_, reject) => {
      timerId = setTimeout(() => reject(new Error('__timeout__')), timeoutMs);
    }),
  ]);
} catch (err) {
  const e = err as Error;
  if (e.message === '__timeout__') {
    controller.abort();
    getLog().error({ ...diagnostics, timeoutMs }, 'claude.first_event_timeout');
    throw new Error(
      'Claude Code subprocess produced no output within ' + timeoutMs + 'ms. ' +
      'See logs for claude.first_event_timeout diagnostic dump. ' +
      'Details: https://github.com/coleam00/Archon/issues/1067'
    );
  }
  throw e;
} finally {
  clearTimeout(timerId);  // always cancel — whether gen won or timer won
}

HIGH-2: Configuration docs directly contradict new env-loading mechanism

📍 packages/docs-web/src/content/docs/reference/configuration.md.env File Locations section

The docs state the CLI loads ~/.archon/.env with override: true "so Archon's own config always wins." This PR removes override: true entirely (replaced by stripCwdEnv() boot stripping). Users and developers who read the docs will rely on a mechanism that no longer exists.

Fix: Update the CLI row in the .env File Locations table:

| **CLI** | `~/.archon/.env` | Global infrastructure config; CWD .env keys stripped before loading (no `override` needed) |

Replace the "How it works" paragraph:

At startup, the CLI strips all keys that Bun auto-loaded from the current working directory (from .env, .env.local, .env.development, .env.production) before loading ~/.archon/.env. This ensures CWD repo keys are fully removed rather than merely overridden. Target repo env vars cannot reach AI subprocesses — SUBPROCESS_ENV_ALLOWLIST blocks all non-whitelisted keys.


🟡 Medium Issues (Consider Fixing)

MEDIUM-1: Direct bun dev:server path still has CWD .env leak

📍 packages/server/src/index.ts:6-12

The PR protects the archon serve path (CLI → startServer()). But packages/server/src/index.ts — used by bun run dev:server — was not updated. It still has a comment saying "No CWD stripping needed" (now contradicted by this PR's motivation). Running bun run dev:server from inside a target repo still hits the original bug.

Fix: Add import '@archon/paths/strip-cwd-env-boot'; as the very first line of packages/server/src/index.ts and remove the stale "No CWD stripping needed" comment block. One line, same pattern as cli.ts.


MEDIUM-2: stripCwdEnv silently ignores unexpected filesystem errors

📍 packages/paths/src/strip-cwd-env.ts:33-38

dotenv.config() returns { error } for both ENOENT (expected — file doesn't exist) and unexpected errors like EACCES (permission denied). All errors are currently discarded silently. A .env file that exists but is unreadable will have its keys skipped, partially re-introducing the #1067 CWD leak with no user feedback.

Fix: Check (result.error as NodeJS.ErrnoException).code !== 'ENOENT' and write a warning to stderr for unexpected errors (consistent with how cli.ts handles .env load failures pre-boot).


MEDIUM-3: Boot module JSDoc over-promises server coverage

📍 packages/paths/src/strip-cwd-env-boot.ts:3

The docstring says "Import this as the FIRST import in CLI and server entry points." The server entry point was deliberately not updated. A future developer may incorrectly add this import to the server based on this comment.

Fix: Change "CLI and server entry points" → "CLI entry points" (trivial).


MEDIUM-4: ARCHON_CLAUDE_FIRST_EVENT_TIMEOUT_MS env var is undocumented

📍 packages/docs-web/src/content/docs/reference/configuration.md

The PR introduces a user-configurable env var to override the 60-second first-event timeout. Users on slow hardware who hit the timeout have no documented escape hatch.

Fix: Add to the AI Providers → Claude env var table:

| `ARCHON_CLAUDE_FIRST_EVENT_TIMEOUT_MS` | Timeout (ms) before Claude subprocess is considered hung | `60000` |

🟢 Low Issues

View 10 low-priority suggestions
# Issue Location Suggestion
L1 Happy-path test timeoutMs: 5000 leaves a dangling timer claude.test.ts:1113 Fixed automatically by HIGH-1 fix
L2 controller.signal.aborted not asserted in timeout test claude.test.ts Add expect(controller.signal.aborted).toBe(true)
L3 firstValue.done === true branch untested claude.ts:303 Add test with an immediately-completing generator
L4 Distinct keys across multiple .env files not tested strip-cwd-env.test.ts Add test with KEY_A in .env and KEY_B in .env.local
L5 getFirstEventTimeoutMs env-var override path untested claude.ts:233-241 Optional — function is private; risk is low
L6 stripCwdEnv JSDoc refers to override: true the CLI no longer uses strip-cwd-env.ts:24 Change to "loaded afterward by each entry point"
L7 .claude/rules/cli.md Startup Behavior still says override: true .claude/rules/cli.md Update to reflect new boot sequence
L8 ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING undocumented configuration.md Add to Core env var table
L9 CLAUDE.md @archon/paths description incomplete CLAUDE.md Add stripCwdEnv/strip-cwd-env-boot; note dotenv as allowed external dep
L10 No troubleshooting entry for nested Claude Code session hang troubleshooting.md Add section with CLAUDECODE=1 warning + workaround env vars

✅ What's Good

  • withFirstMessageTimeout timeout path is exemplary: calls controller.abort(), emits structured log.error with full diagnostic payload (env key names but not values), and throws with the GitHub issue URL for discoverability. This is exactly CLAUDE.md's "fail fast + explicit errors" pattern.
  • processEnv: {} trick for safe key collection: parses dotenv files without writing to process.env, then explicitly deletes only the matched keys. Clean separation of discovery and deletion.
  • __timeout__ sentinel pattern: simple and effective way to distinguish timeout from real generator errors without adding an extra error class.
  • Boot import ordering: @archon/paths/strip-cwd-env-boot is confirmed as the very first import in cli.ts (line 12), before parseArgs, config, resolve, or existsSync. Critical correctness constraint met.
  • strip-cwd-env.test.ts behavioral coverage: 6 tests covering malformed lines, missing file, multi-file, key-not-in-env no-op, and preservation of non-CWD keys — all testing observable outcomes.
  • One-line serve.ts fix: removing skipPlatformAdapters: true is clean with no side effects since adapters self-gate on token presence.
  • getFirstEventTimeoutMs() defensive parsing: validates Number.isFinite(parsed) && parsed > 0; safe fallback to 60s default on invalid input.
  • ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING escape hatch: well-named, follows existing env-var naming conventions, and the warning message itself advertises it.

Suggested Follow-up Issues

Title Priority
"Fix CWD .env leak in direct bun dev:server path (server/src/index.ts)" P2 (if MEDIUM-1 not fixed here)
"Add troubleshooting guide for nested Claude Code session hang (CLAUDECODE=1 warning)" P3

Reviewed by Archon comprehensive-pr-review workflow — 5 specialized agents
Full artifacts: ~/.archon/workspaces/coleam00/Archon/artifacts/runs/13050cf1201ed061a931279a5a35f648/review/

Fixed:
- Clear setTimeout timer in withFirstMessageTimeout finally block (HIGH-1)
- Add strip-cwd-env-boot to server/src/index.ts for direct dev:server path (MEDIUM-1)
- Warn to stderr on non-ENOENT errors in stripCwdEnv (MEDIUM-2)
- Update stale configuration.md docs for new env-loading mechanism (HIGH-2)
- Add ARCHON_CLAUDE_FIRST_EVENT_TIMEOUT_MS and ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING env vars to docs
- Add nested Claude Code hang troubleshooting entry
- Fix boot module JSDoc: "CLI and server" → "CLI" only
- Fix stripCwdEnv JSDoc: remove stale "override: true" reference
- Update .claude/rules/cli.md startup behavior section
- Update CLAUDE.md @archon/paths description with new exports

Tests added:
- Assert controller.signal.aborted on timeout
- Handle generator that completes immediately without yielding
- Strip distinct keys from different .env files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coleam00
Copy link
Copy Markdown
Owner Author

⚡ Self-Fix Report (Aggressive)

Status: COMPLETE
Pushed: ✅ Changes pushed to `archon/task-fix-issue-1067`
Commit: b8f9206
Philosophy: Fix everything unless clearly a new concern


Fixes Applied (15 total)

Severity Count
🔴 CRITICAL 0
🟠 HIGH 2
🟡 MEDIUM 4
🟢 LOW 9
View all fixes
  • Dangling setTimeout timer (packages/core/src/clients/claude.ts:270) — Added timerId + finally { clearTimeout(timerId) } block; eliminates event-loop leak on every successful query
  • Stale docs contradict env-loading (packages/docs-web/src/content/docs/reference/configuration.md:299-303) — Updated CLI table row and "How it works" paragraph to describe stripCwdEnv boot stripping
  • bun dev:server CWD env leak (packages/server/src/index.ts:6) — Added import '@archon/paths/strip-cwd-env-boot' as first import; removed stale "No CWD stripping needed" comment
  • stripCwdEnv silent FS error (packages/paths/src/strip-cwd-env.ts:33) — Added process.stderr.write warning for non-ENOENT errors
  • Boot module JSDoc over-promise (packages/paths/src/strip-cwd-env-boot.ts:3) — Changed "CLI and server entry points" → "CLI entry points"
  • ARCHON_CLAUDE_FIRST_EVENT_TIMEOUT_MS undocumented — Added to AI Providers — Claude table in configuration.md
  • ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING undocumented — Added to Core env vars table in configuration.md
  • Stale override: true in stripCwdEnv JSDoc — Updated to "loaded afterward by each entry point"
  • Stale startup behavior in .claude/rules/cli.md — Updated Startup Behavior section with 4-step new boot sequence
  • CLAUDE.md @archon/paths description incomplete — Added stripCwdEnv/strip-cwd-env-boot and dotenv dep
  • No troubleshooting for nested Claude Code hang — Added full section with cause, fix, and both new env vars
  • timeoutMs: 5000 in happy-path test — Reduced to 50 (timer leak fixed; shorter value keeps tests fast)
  • controller.abort() not asserted in timeout test — Added expect(controller.signal.aborted).toBe(true)
  • firstValue.done === true branch not tested — Added test for immediately-completing generator
  • Overlapping keys across .env files not tested — Added test with distinct keys in .env and .env.local

Tests Added

  • withFirstMessageTimeout — aborts the controller when timeout fires
  • withFirstMessageTimeout — handles generator that completes immediately without yielding
  • stripCwdEnv — strips distinct keys from different .env files

Skipped (0)

(none — all findings addressed)


Validation

✅ Type check | ✅ Lint | ✅ Tests (all pass)


Self-fix by Archon · aggressive mode · fixes pushed to archon/task-fix-issue-1067

…MessageTimeout

Replace the '__timeout__' string sentinel used to identify timeout rejections
with a dedicated FirstEventTimeoutError class. instanceof checks are more
explicit and robust than string comparison on error messages.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coleam00
Copy link
Copy Markdown
Owner Author

Archon PR Validation Report

Verdict: APPROVE

Summary

All three bugs (CWD .env key leak, skipPlatformAdapters hardcode, missing first-event timeout) are confirmed on main and correctly fixed on the feature branch. Fixes are minimal, well-tested, and follow project conventions. No regressions or blocking issues found.

Bug Confirmation

Claim Main Feature
CWD .env non-overlapping key leak Confirmed Fixed — stripCwdEnv() boot wrapper
archon serve hardcodes skipPlatformAdapters: true Confirmed Fixed — one-line removal
No first-event timeout on query() Confirmed Fixed — withFirstMessageTimeout() (60s default)

Issues

No blocking issues found.

What's Done Well

  • Boot ordering separates side-effect from testable pure function
  • processEnv: {} prevents re-contamination during parsing
  • Typed FirstEventTimeoutError sentinel with structured diagnostics
  • Comprehensive test coverage (12 new tests)
  • Documentation and CLAUDE.md updated

Validated by archon-validate-pr workflow

@Wirasm
Copy link
Copy Markdown
Collaborator

Wirasm commented Apr 12, 2026

Review from #1068 / #1071 authors

We authored the original PRs (#1068 and #1071) that this consolidates. The consolidation is solid — stripCwdEnv() with config({ processEnv: {} }) is cleaner than our parse(readFileSync(...)), exported withFirstMessageTimeout is more testable, and the docs additions are welcome.

We'll take over from here and push fixes for the issues below directly to this branch.

Issues to address

  1. dotenv version mismatchpackages/paths/package.json uses "dotenv": "^16" but the rest of the monorepo (packages/cli/package.json) uses "^17.2.3". Will align to ^17.

  2. Docs claim about SUBPROCESS_ENV_ALLOWLIST is factually wrongconfiguration.md says "Target repo env vars cannot reach AI subprocesses — SUBPROCESS_ENV_ALLOWLIST blocks all non-whitelisted keys." Per fix: unblock Archon when launched from a Claude Code terminal and on Bun 1.3+ #1097's finding, the Claude Agent SDK leaks process.env into the spawned child regardless of the env option, so the allowlist does NOT actually block vars from reaching the subprocess. Will correct the docs.

  3. No CLAUDECODE warning in server entry pointcli.ts warns but server/src/index.ts doesn't. Will add for consistency.

  4. Tests don't verify diagnostic dump contentwithFirstMessageTimeout tests check throw + abort but not the claude.first_event_timeout log payload. Will add.

  5. Missing fix: unblock Archon when launched from a Claude Code terminal and on Bun 1.3+ #1097 integration (the actual hang fix) — The CLAUDECODE + CLAUDE_CODE_* markers from the parent shell are still in process.env. Since the SDK bypasses the env option, these reach the subprocess and cause the nested-session deadlock. withFirstMessageTimeout catches the symptom (60s timeout), but fix: unblock Archon when launched from a Claude Code terminal and on Bun 1.3+ #1097's process-level deletion prevents the hang entirely. Will integrate by stripping CLAUDE_CODE_* markers (except auth vars) from process.env at entry points — pattern-matched, not hardcoded.

Related PRs: #1068, #1071, #1097

Wirasm added 4 commits April 12, 2026 11:40
…marker strip, tests

1. Align dotenv to ^17 (was ^16, rest of monorepo uses ^17.2.3)
2. Remove incorrect SUBPROCESS_ENV_ALLOWLIST claim from docs — the SDK
   bypasses the env option and uses process.env directly (#1097)
3. Add CLAUDECODE=1 warning to server entry point (was only in CLI)
4. Add diagnostic payload content test for withFirstMessageTimeout
5. Integrate #1097's finding: strip CLAUDECODE + CLAUDE_CODE_* session
   markers (except auth vars) + NODE_OPTIONS + VSCODE_INSPECTOR_OPTIONS
   from process.env at entry point. Pattern-matched on CLAUDE_CODE_*
   prefix rather than hardcoding 6 names, so future Claude Code markers
   are handled automatically. Auth vars (CLAUDE_CODE_OAUTH_TOKEN,
   CLAUDE_CODE_USE_BEDROCK, CLAUDE_CODE_USE_VERTEX) are preserved.

   Root cause per #1097: the Claude Agent SDK leaks process.env into the
   spawned child regardless of the explicit env option, so the only way
   to prevent the nested-session deadlock is to delete the markers from
   process.env at the entry point.

Validation: bun run validate passes, 125 paths tests (6 new marker
tests), 60 claude tests (1 new diagnostic test), DATABASE_URL leak
verified stripped (target repo .env DATABASE_URL does not affect Archon
DB selection).
…only CWD

The allowlist was wrong for a single-developer tool:
- It blocked keys the user intentionally set in ~/.archon/.env
  (ANTHROPIC_API_KEY, AWS_*, CLAUDE_CONFIG_DIR, MiniMax vars, etc.)
- It was bypassed by the SDK anyway (process.env leaks to subprocess
  regardless of the env option — see #1097)
- It attracted a constant stream of PRs adding keys (#1060, #1093, #1099)

New model: CWD .env keys are the only untrusted source. stripCwdEnv()
at entry point handles that. Everything in ~/.archon/.env + shell env
passes through to the subprocess. No filtering, no second-guessing.

Changes:
- Delete env-allowlist.ts and env-allowlist.test.ts
- Simplify buildSubprocessEnv() to return { ...process.env } with
  auth-mode logging (no token stripping — user controls their config)
- Replace 4 allowlist-based tests with 1 pass-through test
- Remove env-allowlist.test.ts from core test batch
- Update security.md and cli.md docs to reflect the new model

The CLAUDECODE + CLAUDE_CODE_* marker strip and NODE_OPTIONS strip
remain in stripCwdEnv() at entry point — those are process-level
safety (not per-subprocess filtering) and are needed regardless.
The integration tests caught a real issue: without override:true, the
~/.archon/.env load doesn't win over shell-inherited env vars. If the
user's shell profile exports PORT=9999 and ~/.archon/.env has PORT=3000,
the user expects Archon to use 3000.

stripCwdEnv() handles CWD .env files (untrusted). override:true handles
shell-inherited vars (trusted but less specific than ~/.archon/.env).
Different concerns, both needed.

Also adds 6 integration tests covering the full entry-point flow:
1. Global auth user with ANTHROPIC_API_KEY in CWD .env — stripped
2. OAuth token in archon env + random key in CWD — CWD stripped, archon kept
3. General leak test — nothing from CWD reaches subprocess
4. Same key in both CWD and archon — archon value wins
5. CLAUDECODE markers stripped even when not from CWD .env
6. CLAUDE_CODE_OAUTH_TOKEN survives marker strip
@Wirasm
Copy link
Copy Markdown
Collaborator

Wirasm commented Apr 12, 2026

PR Review Summary — Multi-Agent Review

Reviewed by 6 specialized agents: code-reviewer, docs-impact, test-analyzer, silent-failure-hunter, type-design-analyzer, code-simplifier.


Critical Issues (2 found)

Agent Issue Location
code-reviewer (97%) CLAUDECODE warning is dead code. strip-cwd-env-boot (first import) deletes CLAUDECODE from process.env before the warning check process.env.CLAUDECODE === '1' runs. The warning will never fire. Fix: emit the warning inside strip-cwd-env-boot.ts before/during the deletion, or snapshot the value before stripping. cli.ts:34, server/index.ts:44
silent-failure-hunter useGlobalAuth token stripping silently removed. The old code stripped CLAUDE_CODE_OAUTH_TOKEN and CLAUDE_API_KEY from subprocess env when useGlobalAuth=true. That stripping was deleted in this refactor. useGlobalAuth is still computed and logged but never acted on — auth tokens always pass through now regardless of config. This is a behavioral regression from the stated intent of CLAUDE_USE_GLOBAL_AUTH=true. claude.ts:119-145

Important Issues (5 found)

Agent Issue Location
code-reviewer (90%), docs-impact cli.md and configuration.md say "no override needed" but cli.ts:24 uses override: true. Future agents/developers following the docs will omit override, causing shell-inherited vars to win over ~/.archon/.env. .claude/rules/cli.md:33, configuration.md:301
code-reviewer (82%) Unsound type cast: options.env as Record<string, string> drops undefined from NodeJS.ProcessEnv. Change buildFirstEventHangDiagnostics to accept NodeJS.ProcessEnv directly. claude.ts:526
silent-failure-hunter stripCwdEnv throws uncaught in boot module. If process.cwd() throws (deleted directory) or dotenv faults, the process crashes at import time with no user message. Wrap in try/catch with stderr fallback. strip-cwd-env-boot.ts:13
silent-failure-hunter getFirstEventTimeoutMs silently ignores invalid env values. Setting ARCHON_CLAUDE_FIRST_EVENT_TIMEOUT_MS=0 or abc silently falls back to 60s with no warning. Users get no indication their config was ignored. claude.ts:193-200
silent-failure-hunter Generator not closed on timeout. gen.return() is never called when timeout fires — the pending gen.next() promise and generator are leaked. Add void gen.return(undefined).catch(() => {}) in the timeout catch path. claude.ts:250-262

Suggestions (8 found)

Agent Suggestion Location
code-simplifier Dead diagnostic fields. claudeCode and claudeCodeEntrypoint in buildFirstEventHangDiagnostics are always undefined post-stripping. Remove them; rename parentClaudeKeys to parentAnthropicKeys. claude.ts:209-218
code-simplifier Duplicated warning blocks in cli.ts and server/index.ts could share a warnIfNestedSession() helper in strip-cwd-env.ts. cli.ts:34-42, index.ts:44-52
type-analyzer Record<string, unknown> too loose for diagnostics param. Extract a named FirstEventDiagnostics interface. claude.ts:203, 236
test-analyzer (6/10) getFirstEventTimeoutMs env var override is untested. Add tests for valid override, invalid value fallback. claude.ts:193-200
test-analyzer (5/10) Non-timeout error propagation untested in withFirstMessageTimeout. Verify SDK errors pass through unmodified. claude.ts:250-263
docs-impact .env.example missing the two new env vars (ARCHON_CLAUDE_FIRST_EVENT_TIMEOUT_MS, ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING). .env.example
docs-impact cli-internals.md flow diagram missing the strip-cwd-env-boot step. cli-internals.md:42
docs-impact CHANGELOG.md v0.3.4 references removed SUBPROCESS_ENV_ALLOWLIST. CHANGELOG.md:30

Strengths

  • stripCwdEnv() is well-designed: clean separation of CWD stripping and marker removal, proper ENOENT handling, auth var preservation via CLAUDE_CODE_AUTH_VARS set
  • FirstEventTimeoutError class is well-encapsulated — private, used only for local discrimination, instanceof is unambiguous
  • Integration tests (env-integration.test.ts) are excellent — simulateEntryPointFlow faithfully reproduces the entry-point sequence with real DATABASE_URL leak scenarios
  • withFirstMessageTimeout timer cleanup is correct (finally block calls clearTimeout on all paths)
  • Deleted env-allowlist.test.ts appropriately — no dead tests for removed functionality

Documentation Issues

  • .claude/rules/cli.md — "no override needed" contradicts actual code (override: true)
  • configuration.md CLI row — same factual error
  • cli-internals.md — diagram missing strip-cwd-env-boot step
  • .env.example — missing two new env vars
  • CHANGELOG.md v0.3.4 — stale SUBPROCESS_ENV_ALLOWLIST reference

Verdict: NEEDS FIXES

Two critical issues must be addressed before merge:

  1. The nested-session warning is inert — CLAUDECODE is stripped before the check runs
  2. useGlobalAuth=true no longer strips auth tokens from subprocess env (silent behavioral regression)

Recommended Actions

  1. Fix critical: Move warning emission into strip-cwd-env-boot.ts (before CLAUDECODE is deleted)
  2. Fix critical: Restore token stripping in buildSubprocessEnv when useGlobalAuth=true
  3. Fix important: Correct "no override needed" in cli.md and configuration.md
  4. Fix important: Accept NodeJS.ProcessEnv in buildFirstEventHangDiagnostics (remove cast)
  5. Fix important: Add try/catch guard in strip-cwd-env-boot.ts
  6. Fix important: Log warning for invalid ARCHON_CLAUDE_FIRST_EVENT_TIMEOUT_MS values
  7. Fix important: Call gen.return() on timeout path
  8. Address suggestions as time permits

…uth logic

Review findings addressed:

1. CLAUDECODE warning was dead code — the boot import deleted CLAUDECODE
   from process.env before the warning check in cli.ts/server/index.ts
   could fire. Moved the warning into stripCwdEnv() itself, emitted
   BEFORE the deletion. Removed duplicate warning code from both entry
   points.

2. useGlobalAuth token stripping removed (intentional, not regression) —
   the old code stripped CLAUDE_CODE_OAUTH_TOKEN and CLAUDE_API_KEY when
   useGlobalAuth=true. Per design discussion: the user controls
   ~/.archon/.env and all keys they set are intentional. If they want
   global auth, they just don't set tokens. Simplified buildSubprocessEnv
   to log auth mode for diagnostics only, no filtering.

3. Docs "no override needed" corrected — cli.md and configuration.md
   now reflect the actual code (override: true).
prospapledge88 pushed a commit to prospapledge88/Archon that referenced this pull request Apr 14, 2026
…t timeout (coleam00#1067, coleam00#1030, coleam00#1098, coleam00#1070)

* fix: strip CWD .env leak, enable platform adapters in serve, add first-event timeout (coleam00#1067)

Three bugs fixed: (1) Bun auto-loads CWD .env files before user code, leaking
non-overlapping keys into the Archon process — new stripCwdEnv() boot import
removes them before any module reads env. (2) archon serve hardcoded
skipPlatformAdapters:true, preventing Slack/Telegram/Discord from starting.
(3) Claude SDK query had no first-event timeout, causing silent 30-min hangs
when the subprocess wedges — new withFirstMessageTimeout wrapper races the
first event against a configurable deadline (default 60s).

Changes:
- Add @archon/paths/strip-cwd-env and strip-cwd-env-boot modules
- Import boot module as first import in CLI entry point
- Remove skipPlatformAdapters: true from serve.ts
- Add withFirstMessageTimeout + diagnostics to ClaudeClient
- Add CLAUDECODE=1 nested-session warning to CLI
- Add 9 unit tests (6 strip-cwd-env + 3 timeout)

Fixes coleam00#1067

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address review findings for PR coleam00#1092

Fixed:
- Clear setTimeout timer in withFirstMessageTimeout finally block (HIGH-1)
- Add strip-cwd-env-boot to server/src/index.ts for direct dev:server path (MEDIUM-1)
- Warn to stderr on non-ENOENT errors in stripCwdEnv (MEDIUM-2)
- Update stale configuration.md docs for new env-loading mechanism (HIGH-2)
- Add ARCHON_CLAUDE_FIRST_EVENT_TIMEOUT_MS and ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING env vars to docs
- Add nested Claude Code hang troubleshooting entry
- Fix boot module JSDoc: "CLI and server" → "CLI" only
- Fix stripCwdEnv JSDoc: remove stale "override: true" reference
- Update .claude/rules/cli.md startup behavior section
- Update CLAUDE.md @archon/paths description with new exports

Tests added:
- Assert controller.signal.aborted on timeout
- Handle generator that completes immediately without yielding
- Strip distinct keys from different .env files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* simplify: replace string sentinel with typed error class in withFirstMessageTimeout

Replace the '__timeout__' string sentinel used to identify timeout rejections
with a dedicated FirstEventTimeoutError class. instanceof checks are more
explicit and robust than string comparison on error messages.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address review findings — dotenv version, docs, server warning, marker strip, tests

1. Align dotenv to ^17 (was ^16, rest of monorepo uses ^17.2.3)
2. Remove incorrect SUBPROCESS_ENV_ALLOWLIST claim from docs — the SDK
   bypasses the env option and uses process.env directly (coleam00#1097)
3. Add CLAUDECODE=1 warning to server entry point (was only in CLI)
4. Add diagnostic payload content test for withFirstMessageTimeout
5. Integrate coleam00#1097's finding: strip CLAUDECODE + CLAUDE_CODE_* session
   markers (except auth vars) + NODE_OPTIONS + VSCODE_INSPECTOR_OPTIONS
   from process.env at entry point. Pattern-matched on CLAUDE_CODE_*
   prefix rather than hardcoding 6 names, so future Claude Code markers
   are handled automatically. Auth vars (CLAUDE_CODE_OAUTH_TOKEN,
   CLAUDE_CODE_USE_BEDROCK, CLAUDE_CODE_USE_VERTEX) are preserved.

   Root cause per coleam00#1097: the Claude Agent SDK leaks process.env into the
   spawned child regardless of the explicit env option, so the only way
   to prevent the nested-session deadlock is to delete the markers from
   process.env at the entry point.

Validation: bun run validate passes, 125 paths tests (6 new marker
tests), 60 claude tests (1 new diagnostic test), DATABASE_URL leak
verified stripped (target repo .env DATABASE_URL does not affect Archon
DB selection).

* refactor: remove SUBPROCESS_ENV_ALLOWLIST — trust user config, strip only CWD

The allowlist was wrong for a single-developer tool:
- It blocked keys the user intentionally set in ~/.archon/.env
  (ANTHROPIC_API_KEY, AWS_*, CLAUDE_CONFIG_DIR, MiniMax vars, etc.)
- It was bypassed by the SDK anyway (process.env leaks to subprocess
  regardless of the env option — see coleam00#1097)
- It attracted a constant stream of PRs adding keys (coleam00#1060, coleam00#1093, coleam00#1099)

New model: CWD .env keys are the only untrusted source. stripCwdEnv()
at entry point handles that. Everything in ~/.archon/.env + shell env
passes through to the subprocess. No filtering, no second-guessing.

Changes:
- Delete env-allowlist.ts and env-allowlist.test.ts
- Simplify buildSubprocessEnv() to return { ...process.env } with
  auth-mode logging (no token stripping — user controls their config)
- Replace 4 allowlist-based tests with 1 pass-through test
- Remove env-allowlist.test.ts from core test batch
- Update security.md and cli.md docs to reflect the new model

The CLAUDECODE + CLAUDE_CODE_* marker strip and NODE_OPTIONS strip
remain in stripCwdEnv() at entry point — those are process-level
safety (not per-subprocess filtering) and are needed regardless.

* fix: restore override:true for archon env, add integration tests

The integration tests caught a real issue: without override:true, the
~/.archon/.env load doesn't win over shell-inherited env vars. If the
user's shell profile exports PORT=9999 and ~/.archon/.env has PORT=3000,
the user expects Archon to use 3000.

stripCwdEnv() handles CWD .env files (untrusted). override:true handles
shell-inherited vars (trusted but less specific than ~/.archon/.env).
Different concerns, both needed.

Also adds 6 integration tests covering the full entry-point flow:
1. Global auth user with ANTHROPIC_API_KEY in CWD .env — stripped
2. OAuth token in archon env + random key in CWD — CWD stripped, archon kept
3. General leak test — nothing from CWD reaches subprocess
4. Same key in both CWD and archon — archon value wins
5. CLAUDECODE markers stripped even when not from CWD .env
6. CLAUDE_CODE_OAUTH_TOKEN survives marker strip

* test: add DATABASE_URL leak scenarios to env integration tests

* fix: move CLAUDECODE warning into stripCwdEnv, remove dead useGlobalAuth logic

Review findings addressed:

1. CLAUDECODE warning was dead code — the boot import deleted CLAUDECODE
   from process.env before the warning check in cli.ts/server/index.ts
   could fire. Moved the warning into stripCwdEnv() itself, emitted
   BEFORE the deletion. Removed duplicate warning code from both entry
   points.

2. useGlobalAuth token stripping removed (intentional, not regression) —
   the old code stripped CLAUDE_CODE_OAUTH_TOKEN and CLAUDE_API_KEY when
   useGlobalAuth=true. Per design discussion: the user controls
   ~/.archon/.env and all keys they set are intentional. If they want
   global auth, they just don't set tokens. Simplified buildSubprocessEnv
   to log auth mode for diagnostics only, no filtering.

3. Docs "no override needed" corrected — cli.md and configuration.md
   now reflect the actual code (override: true).

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Rasmus Widing <rasmus.widing@gmail.com>
Tyone88 pushed a commit to Tyone88/Archon that referenced this pull request Apr 16, 2026
…t timeout (coleam00#1067, coleam00#1030, coleam00#1098, coleam00#1070)

* fix: strip CWD .env leak, enable platform adapters in serve, add first-event timeout (coleam00#1067)

Three bugs fixed: (1) Bun auto-loads CWD .env files before user code, leaking
non-overlapping keys into the Archon process — new stripCwdEnv() boot import
removes them before any module reads env. (2) archon serve hardcoded
skipPlatformAdapters:true, preventing Slack/Telegram/Discord from starting.
(3) Claude SDK query had no first-event timeout, causing silent 30-min hangs
when the subprocess wedges — new withFirstMessageTimeout wrapper races the
first event against a configurable deadline (default 60s).

Changes:
- Add @archon/paths/strip-cwd-env and strip-cwd-env-boot modules
- Import boot module as first import in CLI entry point
- Remove skipPlatformAdapters: true from serve.ts
- Add withFirstMessageTimeout + diagnostics to ClaudeClient
- Add CLAUDECODE=1 nested-session warning to CLI
- Add 9 unit tests (6 strip-cwd-env + 3 timeout)

Fixes coleam00#1067

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address review findings for PR coleam00#1092

Fixed:
- Clear setTimeout timer in withFirstMessageTimeout finally block (HIGH-1)
- Add strip-cwd-env-boot to server/src/index.ts for direct dev:server path (MEDIUM-1)
- Warn to stderr on non-ENOENT errors in stripCwdEnv (MEDIUM-2)
- Update stale configuration.md docs for new env-loading mechanism (HIGH-2)
- Add ARCHON_CLAUDE_FIRST_EVENT_TIMEOUT_MS and ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING env vars to docs
- Add nested Claude Code hang troubleshooting entry
- Fix boot module JSDoc: "CLI and server" → "CLI" only
- Fix stripCwdEnv JSDoc: remove stale "override: true" reference
- Update .claude/rules/cli.md startup behavior section
- Update CLAUDE.md @archon/paths description with new exports

Tests added:
- Assert controller.signal.aborted on timeout
- Handle generator that completes immediately without yielding
- Strip distinct keys from different .env files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* simplify: replace string sentinel with typed error class in withFirstMessageTimeout

Replace the '__timeout__' string sentinel used to identify timeout rejections
with a dedicated FirstEventTimeoutError class. instanceof checks are more
explicit and robust than string comparison on error messages.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address review findings — dotenv version, docs, server warning, marker strip, tests

1. Align dotenv to ^17 (was ^16, rest of monorepo uses ^17.2.3)
2. Remove incorrect SUBPROCESS_ENV_ALLOWLIST claim from docs — the SDK
   bypasses the env option and uses process.env directly (coleam00#1097)
3. Add CLAUDECODE=1 warning to server entry point (was only in CLI)
4. Add diagnostic payload content test for withFirstMessageTimeout
5. Integrate coleam00#1097's finding: strip CLAUDECODE + CLAUDE_CODE_* session
   markers (except auth vars) + NODE_OPTIONS + VSCODE_INSPECTOR_OPTIONS
   from process.env at entry point. Pattern-matched on CLAUDE_CODE_*
   prefix rather than hardcoding 6 names, so future Claude Code markers
   are handled automatically. Auth vars (CLAUDE_CODE_OAUTH_TOKEN,
   CLAUDE_CODE_USE_BEDROCK, CLAUDE_CODE_USE_VERTEX) are preserved.

   Root cause per coleam00#1097: the Claude Agent SDK leaks process.env into the
   spawned child regardless of the explicit env option, so the only way
   to prevent the nested-session deadlock is to delete the markers from
   process.env at the entry point.

Validation: bun run validate passes, 125 paths tests (6 new marker
tests), 60 claude tests (1 new diagnostic test), DATABASE_URL leak
verified stripped (target repo .env DATABASE_URL does not affect Archon
DB selection).

* refactor: remove SUBPROCESS_ENV_ALLOWLIST — trust user config, strip only CWD

The allowlist was wrong for a single-developer tool:
- It blocked keys the user intentionally set in ~/.archon/.env
  (ANTHROPIC_API_KEY, AWS_*, CLAUDE_CONFIG_DIR, MiniMax vars, etc.)
- It was bypassed by the SDK anyway (process.env leaks to subprocess
  regardless of the env option — see coleam00#1097)
- It attracted a constant stream of PRs adding keys (coleam00#1060, coleam00#1093, coleam00#1099)

New model: CWD .env keys are the only untrusted source. stripCwdEnv()
at entry point handles that. Everything in ~/.archon/.env + shell env
passes through to the subprocess. No filtering, no second-guessing.

Changes:
- Delete env-allowlist.ts and env-allowlist.test.ts
- Simplify buildSubprocessEnv() to return { ...process.env } with
  auth-mode logging (no token stripping — user controls their config)
- Replace 4 allowlist-based tests with 1 pass-through test
- Remove env-allowlist.test.ts from core test batch
- Update security.md and cli.md docs to reflect the new model

The CLAUDECODE + CLAUDE_CODE_* marker strip and NODE_OPTIONS strip
remain in stripCwdEnv() at entry point — those are process-level
safety (not per-subprocess filtering) and are needed regardless.

* fix: restore override:true for archon env, add integration tests

The integration tests caught a real issue: without override:true, the
~/.archon/.env load doesn't win over shell-inherited env vars. If the
user's shell profile exports PORT=9999 and ~/.archon/.env has PORT=3000,
the user expects Archon to use 3000.

stripCwdEnv() handles CWD .env files (untrusted). override:true handles
shell-inherited vars (trusted but less specific than ~/.archon/.env).
Different concerns, both needed.

Also adds 6 integration tests covering the full entry-point flow:
1. Global auth user with ANTHROPIC_API_KEY in CWD .env — stripped
2. OAuth token in archon env + random key in CWD — CWD stripped, archon kept
3. General leak test — nothing from CWD reaches subprocess
4. Same key in both CWD and archon — archon value wins
5. CLAUDECODE markers stripped even when not from CWD .env
6. CLAUDE_CODE_OAUTH_TOKEN survives marker strip

* test: add DATABASE_URL leak scenarios to env integration tests

* fix: move CLAUDECODE warning into stripCwdEnv, remove dead useGlobalAuth logic

Review findings addressed:

1. CLAUDECODE warning was dead code — the boot import deleted CLAUDECODE
   from process.env before the warning check in cli.ts/server/index.ts
   could fire. Moved the warning into stripCwdEnv() itself, emitted
   BEFORE the deletion. Removed duplicate warning code from both entry
   points.

2. useGlobalAuth token stripping removed (intentional, not regression) —
   the old code stripped CLAUDE_CODE_OAUTH_TOKEN and CLAUDE_API_KEY when
   useGlobalAuth=true. Per design discussion: the user controls
   ~/.archon/.env and all keys they set are intentional. If they want
   global auth, they just don't set tokens. Simplified buildSubprocessEnv
   to log auth mode for diagnostics only, no filtering.

3. Docs "no override needed" corrected — cli.md and configuration.md
   now reflect the actual code (override: true).

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Rasmus Widing <rasmus.widing@gmail.com>
joaobmonteiro pushed a commit to joaobmonteiro/Archon that referenced this pull request Apr 26, 2026
…t timeout (coleam00#1067, coleam00#1030, coleam00#1098, coleam00#1070)

* fix: strip CWD .env leak, enable platform adapters in serve, add first-event timeout (coleam00#1067)

Three bugs fixed: (1) Bun auto-loads CWD .env files before user code, leaking
non-overlapping keys into the Archon process — new stripCwdEnv() boot import
removes them before any module reads env. (2) archon serve hardcoded
skipPlatformAdapters:true, preventing Slack/Telegram/Discord from starting.
(3) Claude SDK query had no first-event timeout, causing silent 30-min hangs
when the subprocess wedges — new withFirstMessageTimeout wrapper races the
first event against a configurable deadline (default 60s).

Changes:
- Add @archon/paths/strip-cwd-env and strip-cwd-env-boot modules
- Import boot module as first import in CLI entry point
- Remove skipPlatformAdapters: true from serve.ts
- Add withFirstMessageTimeout + diagnostics to ClaudeClient
- Add CLAUDECODE=1 nested-session warning to CLI
- Add 9 unit tests (6 strip-cwd-env + 3 timeout)

Fixes coleam00#1067

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address review findings for PR coleam00#1092

Fixed:
- Clear setTimeout timer in withFirstMessageTimeout finally block (HIGH-1)
- Add strip-cwd-env-boot to server/src/index.ts for direct dev:server path (MEDIUM-1)
- Warn to stderr on non-ENOENT errors in stripCwdEnv (MEDIUM-2)
- Update stale configuration.md docs for new env-loading mechanism (HIGH-2)
- Add ARCHON_CLAUDE_FIRST_EVENT_TIMEOUT_MS and ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING env vars to docs
- Add nested Claude Code hang troubleshooting entry
- Fix boot module JSDoc: "CLI and server" → "CLI" only
- Fix stripCwdEnv JSDoc: remove stale "override: true" reference
- Update .claude/rules/cli.md startup behavior section
- Update CLAUDE.md @archon/paths description with new exports

Tests added:
- Assert controller.signal.aborted on timeout
- Handle generator that completes immediately without yielding
- Strip distinct keys from different .env files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* simplify: replace string sentinel with typed error class in withFirstMessageTimeout

Replace the '__timeout__' string sentinel used to identify timeout rejections
with a dedicated FirstEventTimeoutError class. instanceof checks are more
explicit and robust than string comparison on error messages.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address review findings — dotenv version, docs, server warning, marker strip, tests

1. Align dotenv to ^17 (was ^16, rest of monorepo uses ^17.2.3)
2. Remove incorrect SUBPROCESS_ENV_ALLOWLIST claim from docs — the SDK
   bypasses the env option and uses process.env directly (coleam00#1097)
3. Add CLAUDECODE=1 warning to server entry point (was only in CLI)
4. Add diagnostic payload content test for withFirstMessageTimeout
5. Integrate coleam00#1097's finding: strip CLAUDECODE + CLAUDE_CODE_* session
   markers (except auth vars) + NODE_OPTIONS + VSCODE_INSPECTOR_OPTIONS
   from process.env at entry point. Pattern-matched on CLAUDE_CODE_*
   prefix rather than hardcoding 6 names, so future Claude Code markers
   are handled automatically. Auth vars (CLAUDE_CODE_OAUTH_TOKEN,
   CLAUDE_CODE_USE_BEDROCK, CLAUDE_CODE_USE_VERTEX) are preserved.

   Root cause per coleam00#1097: the Claude Agent SDK leaks process.env into the
   spawned child regardless of the explicit env option, so the only way
   to prevent the nested-session deadlock is to delete the markers from
   process.env at the entry point.

Validation: bun run validate passes, 125 paths tests (6 new marker
tests), 60 claude tests (1 new diagnostic test), DATABASE_URL leak
verified stripped (target repo .env DATABASE_URL does not affect Archon
DB selection).

* refactor: remove SUBPROCESS_ENV_ALLOWLIST — trust user config, strip only CWD

The allowlist was wrong for a single-developer tool:
- It blocked keys the user intentionally set in ~/.archon/.env
  (ANTHROPIC_API_KEY, AWS_*, CLAUDE_CONFIG_DIR, MiniMax vars, etc.)
- It was bypassed by the SDK anyway (process.env leaks to subprocess
  regardless of the env option — see coleam00#1097)
- It attracted a constant stream of PRs adding keys (coleam00#1060, coleam00#1093, coleam00#1099)

New model: CWD .env keys are the only untrusted source. stripCwdEnv()
at entry point handles that. Everything in ~/.archon/.env + shell env
passes through to the subprocess. No filtering, no second-guessing.

Changes:
- Delete env-allowlist.ts and env-allowlist.test.ts
- Simplify buildSubprocessEnv() to return { ...process.env } with
  auth-mode logging (no token stripping — user controls their config)
- Replace 4 allowlist-based tests with 1 pass-through test
- Remove env-allowlist.test.ts from core test batch
- Update security.md and cli.md docs to reflect the new model

The CLAUDECODE + CLAUDE_CODE_* marker strip and NODE_OPTIONS strip
remain in stripCwdEnv() at entry point — those are process-level
safety (not per-subprocess filtering) and are needed regardless.

* fix: restore override:true for archon env, add integration tests

The integration tests caught a real issue: without override:true, the
~/.archon/.env load doesn't win over shell-inherited env vars. If the
user's shell profile exports PORT=9999 and ~/.archon/.env has PORT=3000,
the user expects Archon to use 3000.

stripCwdEnv() handles CWD .env files (untrusted). override:true handles
shell-inherited vars (trusted but less specific than ~/.archon/.env).
Different concerns, both needed.

Also adds 6 integration tests covering the full entry-point flow:
1. Global auth user with ANTHROPIC_API_KEY in CWD .env — stripped
2. OAuth token in archon env + random key in CWD — CWD stripped, archon kept
3. General leak test — nothing from CWD reaches subprocess
4. Same key in both CWD and archon — archon value wins
5. CLAUDECODE markers stripped even when not from CWD .env
6. CLAUDE_CODE_OAUTH_TOKEN survives marker strip

* test: add DATABASE_URL leak scenarios to env integration tests

* fix: move CLAUDECODE warning into stripCwdEnv, remove dead useGlobalAuth logic

Review findings addressed:

1. CLAUDECODE warning was dead code — the boot import deleted CLAUDECODE
   from process.env before the warning check in cli.ts/server/index.ts
   could fire. Moved the warning into stripCwdEnv() itself, emitted
   BEFORE the deletion. Removed duplicate warning code from both entry
   points.

2. useGlobalAuth token stripping removed (intentional, not regression) —
   the old code stripped CLAUDE_CODE_OAUTH_TOKEN and CLAUDE_API_KEY when
   useGlobalAuth=true. Per design discussion: the user controls
   ~/.archon/.env and all keys they set are intentional. If they want
   global auth, they just don't set tokens. Simplified buildSubprocessEnv
   to log auth mode for diagnostics only, no filtering.

3. Docs "no override needed" corrected — cli.md and configuration.md
   now reflect the actual code (override: true).

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Rasmus Widing <rasmus.widing@gmail.com>
@Wirasm Wirasm deleted the archon/task-fix-issue-1067 branch April 27, 2026 13:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment